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(57) Abstract: Schema-driven XML parsing techniques allow an XML parser to optimize its parsing process by composing parse 
and to dynamically generate parsing code components ha,sed on XML schema definition for the targeted XML docamenL lliese 
techniques reduce the XML pareing time and reduce the memory requirement during parsing process. Further, a reconfigurable 
parser is provided which is guided during parsing of the XML document by XML element lexicogr^hical information and state 
transftion information extxacted from a schema associated widi the XML docum^L Pre-allocated elemrat object pools may be 
provided based on the schema analysis to xedoce fte lequhements for dynamic memoxy allocation and de-aUocation operations. 
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METHOD AND APPARATUS FOR SCHEMA-DRIVEN XML PARSING OPTIMIZATION 

CROSS REFERENCE TO RELAT ED APPLICATIONS 

The present application claims priority to U.S. provisional application no. 60/516,037 
filed on October 30, 2003, incorporated herein by reference. 

BACKGROUND OF THE INVENTION 

1 . Field of the Invention 

The present invention relates to methods and systems for data exchange in an 
information processing system. In particular, the present invention relates to providing an 
optimized parser for processing a structured document (e.g., a XML document) in an 
information processing system. 

2. Discussion of the Related Art 

XML is a platform-independent text-based document format^ designed to be used in 
structured documents maintained in an information processing system. XML documents 
(e.g., forms) have become the favored mechanism for data exchange among application 
programs sharing data over a network (e.g., the Internet). XML documents have the 
advantages that (a) the infoimation in an XML document is extensible (i.e., an application 
program developer can define a document structure using, for example, an XML schema 
description), and (b) through the XML schema, an application developer can control the 
range of values that can be accepted for any of the XML element or attribute in the structured 
document. For example, in an XML schema-defined form for a pair of shoes, the application 
program developer may constrain the shoe size attribute accepted by the form to be between 
5 and 12. As a result, the form would reject as invalid input a shoe size of 100. 

Because of these advantages, XML is widely used in consumer application programs. 
However, additional processing overhead is imposed on the application program to allow 
XML to be read and edited easily by a human using a word processor as interface, because 
the structured data in an XML document are required to be parsed by the application into 
representations that can be manipulated in the computer by the application program. Parsing 
requires intensive computational resources, such as CPU cycles and memory bandwidth, as 

' In this description, the platform-indepetident text-based document format means a text format for defining a 
document which is independent of the underlying software platform (e.g., the operating system), the underlying 
hardware platform, or both. 



1 



wo 2005/043327 



PCT/US2004/036054 



the application program processes the XML elements or attributes one character at a time, in 
addition to implementing the higher level processing requirements of the XML schema. In a 
typical XML document, fliere can be a large number of elements and attributes which are 
defined in the schema using difierent data types and constraints. Character-matching is not 
5 efficient in existing hardware implementations, such as those based on IA32 and ARM 
architectures. 

Parsers for documents written in numerous languages have been developed and used 
throughout the history of computers. For example, the first widely accepted parsers (which 
also validate) for XML are based on the W3C Document Object Model (DOM). DOM 

10 renders the infonnation on an XML document into a tree structure. Thus, a parser based on 
DOM constructs a "DOM tree" in memory to represent the XML document, as it reads the 
XML document. The DOM tree is then passed to the application program which traverses the 
DOM tree to extract its required infoimatioa Constructing a DOM tree in memory is not 
only time-consuming, it requires a large amount of memory. In fact, the memory occupied by 

1 5 a DOM tree is usually 5-10 times greater than that of occupied by the original XML 

document One optimization constructs a partial DOM tree in memory as needed to reduce 
the memory requirement and the processing time. 

Alternatively, an XML document may be parsed based on a streaming model. Parsers 
using the streaming model include SAX and Pull. Under the streaming model, rather than a 

20 parse tree, a parser outputs a continuous stream of XML elements, together with the values of 
their attributes, as the XML document is parsed. Typically, such a parser reads from the 
XML document one XML element at a time, and passes to the consuming application the 
values of the element and their associated attributes. Although a streaming-based parser is 
efficient in its memory and processing speed requirements, such a parser merely tokenizes a 

25 string into segments of text without interpretation. The interpretation of data contained in 
each text segment is entirely left to the consuming application program. Thus, the burden of 
XML processing — which is to provide data in an XML document to the application program 
in a manner that can be readily used by the application program - is shifted from the parser 
to the consuming application program. 

30 A parser may or may not validate an XML document. Validation is the process by 

which each parsed XML element is compared against its definition defined in an XML 
schema (e.g., an XML DTD file). Validation typically requires string pattern-matching as the 
validation program searches the multiple element definitions in the XML schema. A 
conventional approach to simplify validation is to convert the definitions of an XML schema 

35 into component models, expressed as a series of Java bean classes. An application program 
may then check the XML elements using methods provided in the Java bean classes. While 
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schema conversion methods may speed up both the parsing and the validation processes to 
some limited degree, such conversions do not provide the fast string pattern-matching deswed 
in XML parsing and validating. 

As is apparent from the above, XML parsing involves a substantial amount of string- 
5 matching operations, which are the most CPU intensive operations in XML parsing. Further, 
the memory requirements of parsing XML elemaits also lead to a substantial amount of 
inefficient memory allocation and de-allocation operations. 

Summarv 

The present invention provides an XML validating parser that can dynamically 
10 generate executable parsing codes based on information extracted from an XML schema 
document that is either stored locally or obtained from a remote machine via network. 
According to another aspect of the invention, a schema-based, reconfigurable parser is 
provided. 

In one embodiment, each XML element of an XML document is parsed and validated 
1 5 using a dedicated executable parsing code ("parselet"), which navigates the stmcture of the 
XML element, its attribute values and constraints to validate the element. If the element is 
valid, the examined XML element is passed to a consuming application program requiring 
the XML document to be processed. Otherwise, an invalid exception is raised and the 
consuming application program is notified. Because parsing in this instance is performed in a 
20 compiled executable parselet, parsing is faster than the interpretive parsers of the prior art and 
the memory requirement for string matching can be much reduced. 

According to one embodiment of the present invention, a lexicographical analysis of 
the XML elements is performed in advance for a given XML schema to provide: (1) state- 
transidon sequence information, and (2) element and attribute lexicographical distance 

25 information. The transition sequence information can be used to guide the parser as to the 
XML elements that may be expected to follow according to the given schema. The element 
and attribute lexicographical distance measures a minimal lexicographical distance between 
two strings (i.e., the smallest indices in the strings sufficient to identify and distinguish the 
strings) . This information is useftil for guiding the parser to identify the element or attribute 

30 of the XML document using the minimal amount of string comparison. 

In one embodiment of the present invention, pools of XML element objects having 
pre-determined element-attribute structures are created when the parser is instantiated, which 
are dynamically managed so that the sizes of the pools vary as needed. A schema analysis 
method provides memory requirement information to allow a parse tree of the XML 
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document be built in memoiy as a DOM object using objects in the element object pools . 
The schema analysis also provides infomfiation for managing the sizes of element pools and 
the type of element objects in each pool Having element object designs in the pools alleviate 
the parser *s memory management requirements. 

5 A parselet of the present invention is a compiled, executable code that executes much 

faster than a corresponding interpretive code which examines the schema tree and XML 
document tree at run time. In addition, the present invention provides multiple parselets for 
different XML elements, so that parallel processing of multiple XML elements 
simultaneously is possible. The compiled parselets may be used for multiple XML 
1 0 documents based on the same XML schema. 

An XML schema-driven parser of the present invention may be configurable to output 
XML elements in the form of a DOM tree (or a similar parse tree), or in a stream, as 
appropriate, according to the consuming application program. The parse tree of the present 
invention need not provide in memory an entire DOM tree. The present invention allows a 
IS partial parse tree to be constructed on demand, including as little as a single element. Thus, 
the memory requirement may be reduced significantly while still allowing application 
programs to access XML document using DOM APIs. 

By using minimal lexicographical distances between XML elements, string-matching 
operations are significantly reduced, thus significantly reducing the parser's demand on 
20 computational power. 

By using pools of XML elements, significant amount of dynamic memory allocation 
operations may be avoided during parsing. When an element object is no longer needed (e.g. 
in the case of a streaming-based parsing), the element object is returned to the respective pool 
to be reused instead of being de-allocated. As a result of maintaining element pools, 
25 significant reductions in the requirements on CPU and memory resources are achieved. For 
example, the need for a garbage collector process - which does not reclaim memory from 
finalized objects immediately - may be avoided. Avoiding the need for a garbage collection 
process also reduces the requirements on the CPU. 

The present invention is better understood upon consideration of the detailed 
30 description below and the accompanying drawings. 

Brief Description of the Drawings 

Figure 1 is a block diagram of schema-driven parser generator system 120, according 
to one embodiment of the present invention. 
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Figure 2 is a flow chart illustrating the operations of schema-driven parser generator 
system 120 under push mode. 

Figure 3 summarizes the operations of schema-driven parser generator system 120 
under push mode. 

5 Figure 4 is a flow chart illustrating ttie operations of schema-driven parser generator 

system 120 under pull mode. 

Figure 5 summarizes the operations of schema-driven parser generator system 120 
under pull mode. 

Figure 6 shows dynamically reconfigurable parser system 620, according to another 
1 0 aspect of the present invention. 

Detailed description of the Preferred Em bodiments 

The present invention provides a validating parser that validates an XML document as 
it is parsed, thereby reducing the validating and parsing tune requirements and the memory 
bandwidth requirement The present invention may be applied to implement a parser code 
1 5 ^nerator and a reconfiguiable parser. 

According to one embodiment of the present invention, based on a specific XML 
schema, a validating parser generator dynamically generates an executable parser code 
("parselet") for implementing parsing of a specific XML element in the XML schema. Figure 
1 is a block diagram of schema-driven parser generator system 120, according to one 
20 embodiment of the present invention. As shown in Figure 1. schema-driven validating parser 
system 120 includes: 

(1) schema reader 102, which reads one ore more XML schema documents 
(e.g., XML document 100); 

(2) XML reader 1 12, which reads XML documents (e.g., XML document 

25 110); 

(3) XML parser integrator 104, which coordinates among the different 
components of the XML parser (in this example, the XML parser consists of (a) XML 
parser integrator 104, (b) XML parser generator 106, and (c) parselets (e.g., parselet 
108); and 

30 (4) XML output module 114, which outputs validated, parsed XML elements 

of the XML document for use by XML application program 116. 
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Parselets (e.g., parselet 108) are excutable parsing codes created by XML parser 
generator 1 06 based on XML elements read by schema reader 1 02 from XML schema 
document 100, Each parselet is called to parse an XML element in an XML document (e.g., 
XML document 110) and to extract element and attribute values from the XML element and 
5 all its included XML elements. The parsed elements are output by XML output module 1 14 
to XML application program 116. Schema-driven XML parser system 120 is especially 
beneficial when multiple XML documents are based on the same underlying XML schema, 
so that each parselet can be re-used multiple times. 

XML schema document 100 may be a document retrieved from a local storage or 
10 from a remote storage via a network (e.g., the Internet) using one of various network 

transport protocols (e.g., FTP, HTTP, and SSL). Schema reader 102 reads the XML elements 
from XML schema document 100 one element at a tim^ In this embodiment, if an element is 
nested (Le., it contains other XML elements), the contained elements are read before reading 
of the containing element is complete. Schema reader 102 may implement a push style or a 
1 5 pull style of reading XML elements. Under a push style, schema reader 1 02 continuously 
read XML schema document 100 until an entire element is read, whereupon schema reader 
102 notifies XML parser integrator 1 04. Under a pull style, XML parser integrator 1 04 
requests that schema reader 102 read the next XML element from schema document 100. 

When an entire element is completely read, XML parser integrator 104 causes XML 
20 parser generator 106 to generate a corresponding parselet for the XML element read. XML 
parser integrator 104 maintains a mapping table, which includes all relationships between a 
parselet, all its containing parselets and all the parselets it contains. In addition, tfie mapping 
table also records a name space and a qualified name for each parselet (a qualified name is 
typically a prefix encoding a pafli to flie parselet). 

25 To illustrate an application of the present invention, an example of an XML document 

is provided in a "PurchaseOrder** document shown in Appendix A. As shown in Appendix A, 
PurchaseOrder is an XML element which includes other XML elements "shipTo", "billTo", 
"commenf • and "items". Elements "shipTo*' and "billTo" each include instances of elements 
"name", "street", "city", "state" and "zip". Element "items" may include one or more 

30 instances of element "item." Element "item" may include instances of elements 

"productName", "quantity" "USPrice", "comment" and "shipDate". One or more attributes 
may be found in each element, which values are provided by a string representing the 
appropriate data type. For example, element "PurchaseOrder^' includes attribute "orderDate** 
and element "shipTo" includes attribute "countiy" Appendix A is flie form of the document 

35 that is typically exchanged between the client (e.g., a web browser) and the application 
program. The corresponding schema document is shown in Appendix B. 
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Appendix B shows a schema which includes, at the top level, elements 
''purchaseOider*' and "comment". Element "comment" is defined in the schema to be a 
string. Element "purchaserOrder** is defined in the schema to be an element of the data type 
"PurchaseOrderiype", which is defined to include elements, sequentially, "shipTo", "billTo", 

5 "comment" and "items", and attribute "orderDate" as already seen in Appendix A. The term 
"sequence" indicates the order in which the elements appears in the schema is expected to be 
the order in which those elements appear in the XML document The schema further defines 
fliatthe elements "shipTo" and "billTo" are both of the data type "USAddress", which is 
defined to include elements "name^ "street", "city", "state" and "zip", as also seen in 

1 0 Appendix A. The schmia defines "name", "street", "city" and "state" each to be a string and 
"zip" to be of the data type "Decimal". Similarly, element "items" is defined in the schema 
to include zero or more instances of element "item". Element "item" includes, sequentially, 
elements "productName", "quantity", "USPrice", "comment" and "shipDate". Element 
"item" also has an attribute "partNum" which is of the data type "SKU" - a format for a part 

15 number which is also defined in the schema.. The data type of element "item" is not 
provided a name. Using the information specified in the schema of Appendix B, parser 
generator 106 generates the parse code for parsing the elements as they appear in the XML 
document One example of the parse code for element "purchaseOrdei" is shown in 
Appendix C. 

20 Thus, XML parser generator 106 generates a parselet for the element "purchaseOrder" 

wlucb includes also code generated for parsing all the included elements. Note that, in fliis 
embodiment, as tiie simple data types "string", "Decimal" and "Integer" and the various 
variations of the data type "Date*' are encountered frequently in XML documents, the parse 
code for these data types are not generated specifically for every schema. Rather a base class 

25 "Parselet" is provided, and the specifically generated parselets, such as "purchaseOrder" is 
derived from the "Parselet" class, so that parse codes for these common data types are 
associated with every specifically generated parselet An example of "Parselet" class is 
provided in Appendix D. 

In Appendix D, the mediods for parsing data types "string", "Decimal" and "Integer" 
30 and the various variations of "Date" are provided as "parseString", "parseDecimal" and 
"parselnteger" and "parseDate", respectively. In addition, methods for validating elements 
and attributes (e.g., "isElement" and "isAttribute") are also provided in class "Parselet". 
During parsing, "Parselet" keeps track of its progress througji the XML document - i.e., 
where in the XML document is the current text object being parsed « by the method 
35 "EEMoveCursor". Error condition or "exception** handler "InvalidSchema" may be called 
ftom "Paiselet". An example of "InvalidSchema" is provided in Appendix E. As discussed 
above, the output firom the parsing operation may be a DOM tree, which is built firom a 



7 



wo 2005/043327 



PCT/US2004/036054 



number of DOM nodes interconnected from a root node. An example of some pseudocode 
for creating fte DOM tree is provided the class **Node" in Appendix R 

Returning to Appendix C (i.e., the listing of generated parselet "purchaseOrder"). 
according to the structure of the XML document "purchaseOrder** as defined in fiie XML 
S schema, parselet "purchaseOrder" parses both elements "'purchaseOrder" and ''comment^ ' at 
flie top level of Ae schema. To parse element "purchaseOrdef, the meftod 
"parsePurchaseOrderType*' is called to handle the data type "purchaseOrderType". Method 
•"parsePurchaseOrderType" parses, sequentially, the required attribute "OrderDate", and each 
of elements "shipTo", *'billTo", "comment" and "items". As elements "shipTo" and "billTo" 

10 are both of the same data type "USAddress", parsing of each element is handled by method 
"parsellSAddress". Element "items" is handled by method "parseltems", which is also 
generated according to the structure defined in the schema. As the data type of element 
"item" contained in element "items" is not given a name, XML parser generator 106 gives the 
method for handling this data type the name "parseUraiamedl". Method "parseUnnamedP 

15 parses, sequentially, the required attribute "partNum" and elements "productName", 

"quantity", "USPrice*', "comment" and "shipDate". Note that the parsing code also validates 
each element including testing if the supplied values are within the accepted range of values 
for each element. Attrribute "partNum" is parsed using the generated method "parseSKU" 
As each element is successfully parsed, a node corresponding to tiie element is added to the 

20 parse tree using method "addChild" in the class "Node" defined in Appendix F. 

During actual parsing operation, when an element is read by XML reader 112 from 
XML document 1 10, a corresponding parselet (say, parselet 108) is selected from mapping 
table, based on the name space, the element's qualified name, and the relationships involving 
parselet 108. When parselet 108 completes its parsing task, the parsed XML element is 
25 forwarded to parser integrator 104, which passes the parsed element to XML output module 
114. 

According to one embodiment, parsed XML elements output from the parselets are 
validated against the elements* definitions in their respective schemas. Here, the term 
"parsed" may mean either (1) that the XML element has been converted into a structured data 
30 representation, such as a DOM tree, or (2) that the textual XML element has been validated 
by parselet 108 to conform its corresponding definition in the schema document In the case 
of a DOM tree, an application program can directly access the XML element through XML 
output module 114, Alternatively, i.e., the validation of the textual XML element without 
more requires both lesser processing time and memory, relative to building a DOM tree. 

35 Figure 2 is a flow chart illustrating the operations of schema-driven parser generator 

system 120 under push mode. As shown in Figure 2, at step 200, XML reader 112 reads 
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XML document 110 one element at a time. When XML element is completely read, XML 
reader 1 12 generates an event, which is ssigned a sequence number at step 202. The 
sequence number may be provided in ascending order by a counter. At step 204, XML reader 
1 1 2 notifies XML parser integrator 104 of the event Upon notification of the event, at step 

5 206, XML parser integrator 104 examines and uses the name space and qualified name of the 
XML element associated with the event to select bom the mapping table an appropriate 
parselet. At step 208, the selected parselet parses and validates the XML element. The 
parselet may provide its output in a DOM tree structure, or simply provide a textual 
representation of the XML element, according to the requirement of XML application 

1 0 program 116. At step 2 1 0, the selected parselet notifies XML parser integrator 1 04 of the 
parsed XML element XML parser integrator 104 then pass the result of the parsing, together 
with information regarding the event, to XML output module 1 14 at step 212. XML output 
module 114 maintains a queue of parsed elements sequentially of event sequence numbers. 
At step 214, XML output module 1 14 notifies application program 116 of the parsed element 

1 5 being added to the queue. 

Figure 3 summarizes the operations of schema-driven parser generator system 120 
under push mode. 

Figure 4 is a flow chart illustrating flie operations of schema-driven parser generator 
system 120 under pull mode. As shown in Figure 4, in the pull mode, XML application 

20 program 1 1 6 initiates the parsing and validating process by asking the XML output module 
1 14 for die next XML element at step 300. XML output module 1 14 in turn requests the next 
XML element from XML parser integrator 104 at step 302. At step 304, XML parser 
integrator 104 then requests XML reader 1 1 2 read the next XML element from XML 
document 110, which is accomplished at step 306. At step 308, XML reader 112 passes the 

25 XML element read to XML parser mtegrator 104, which selects at step 310 corresponding 
parselet 108 for validation and parsing based on the XML element's name space and qualified 
name. At step 312, selected parselet 1 08 (hen parses and validates the XML element As in 
the push mode, the parsed XML element may be represented in a DOM tree structure, or in a 
textual representation, according to the requirements of application program 116. Parselet 

30 108 provides the parsed XML element to XML parser integrator 104 at step 314. At step 316, 
XML parser integrator 104 provides the parsed XML element to XML output module 114, 
which provides the parsed XML element to XML application program 1 1 6 at step 3 1 8. 

Figure 5 summarizes Ae operations of schema-driven parser generator system 120 
under pull mode. 

35 According to another aspect of the present invention. Figure 6 shows (fynamically 

reconfigurable parser system 620. As shown in Figure 6, reconfigurable XML parser 603 
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uses schema analyzer 601 to obtain in lexicographical order XML elements defined in a 
schema document 600. Reconfigurable parser 603 then compares between every pair of 
adjacent XML elements to determine a minimal lexicographical distance between them. 
These minimal lexicographical distances guide reconfigurable parser 603 to identify XML 
5 elements during parsing. That is, reconfigurable parser 603 need only perform pattern-match 
sufficient to recognize fhe attribute or element being parsed. In addition to minimal 
lexicographical distances^ schema analyzer 601 also provides state-transition information, 
such as a list of possible next elements that may appear in tiie XML document^ as determined 
based on fhe current state. 

10 Based on the information provided by schema analyzer 601 » reconfigurable parser 603 

parses XML document 604. According the present invention, reconfigurable parser 603 
manages a number of element pools 605, which are created at system initialization according 
to the expected elements to be encountered. Each element pool includes a number of pre- 
allocated data structure ("XML element object**) created from a template of an expected XML 

15 element. As each element is successfully parsed, an XML clement object is retrieved from 
the appropriate element pool and assigned as a node in a parse tree. Element pools 605 are 
resizable and can vary in size dynamically, as reconfigurable parser 603 parses XML 
document 604, according to the size and complexity of XML document 604. 

In one embodiment, application program 602 invokes XML parser 603 to parse XML 
20 document 604. Initially, reconfigurable parser 603 identifies the references in XML 
document 604 to XML elements defined in XML schema document 600. The schema 
references are provided schema analyzer 601 to retrieve previously extracted lexicographical 
and state-transition infonnation corre^nding to bese references. Reconfigurable parser 
603 then parses the XML references, requests XML element objects corresponding to the 
25 parsed elements from XML element pools 605, fills the XML element objects returned with 
the parsed data, links the XML element object to a parsed tree. When all XML references are 
parsed, the parse tree is provided to application program 602. 

The above detailed description is provided to illustrate the specific embodiments of 
the present invention and is not intended to be limiting. Numerous modifications and 
30 variations within the present invention are possible. The present invention is set forth in the 
following claims. 
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A ppendix A 

<?xml version^"!. 0"?> 

<purcha seOrder orderDa t e » ■1999-10-20"> 
<shipTo country =" US "> 
S <name>Allce 5raith</name> 

<street>123 Maple Street</street> 
<city>Mill Valley</city> 
<state>CA</state> 
<zip>90952</zip> 
10 </shipTo> 

<billTo country="US"> 

<name>Robert STnith</name> 
<street>8 Oak Avenue</ street > 
<city>01d Town</city> 
IS <6tate>PA</state> 
<zip>95819</zip> 
</billTo> 

<corament>Hurry, my lawn is going wild! </cotnment> 
<items> 

20 <item partNuin="872-AA"> 

<productName>Lawnmower</productName> 

<quantity>l</quantity> 

<USPrice>14 8 . 95</USPrice> 

<corament>Conf irm this is elect ric</coinment> 
25 </item> 

<item partNum="926-AA"> 

<produc tName >Baby Moni tor < /productName > 
<quantity>l< /quant ity> 
<USPrice>39 . 98</USPrice> 
30 <shipDate>1999«05-21</shipDate> 
</item> 
</items> 
</purchaseOrder> 
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A ppendix B 

<xsd : schema xmlns : xsd= " ht tp : //www . w3 - org/2 00 1 /XMLSchema " > 
<xsd : element name* "purchaeeOrder " type= " PurchaseOrderType " / > 

5 

<xsd:element name* "comment* types^xBd: string "/> 

<x8d:ComplexType name» " PurchaseOrderType •»> 
<xsd: sequence> 
10 <xsd: element name^^shipTo" type="USAddress"/> 

<x8d: element names^billTo" type="USAddress"/> 
<xsd: element ref=" comment" minOccurs="0"/> 
<xsd: element name="i terns" types" Items" /> 
</xsd: sequence^ 
15 <xsd: attribute naroe«"orderDate" type="xsd:date'*/> 
</xsd : complexType> 

<xsd : compl exType name«^ " USAddr e s s " > 
<xsd : s equence> 
20 <xsd: element names'name" type^'xsd: string" /> 

<xsd: element name«* street " type="xsd;string''/> 
<xsd: element name=''city" type="xsd:string"/> 
<xsd: element name='' state" type="xsd : string" /> 
<:xsd; element name*" zip" typea'xsd : decimal "/> 
25 </xsd: sequence> 

<xsd: attribute name=" country" type="xsd:NMTOKEN" 
fixed-"US"/> 
< /xsd : complexType > 

30 <xsd: complexType name* " Items" > 
<x8d:Bequence> 

<x6d: element name="item" minOccurs="0" maxOccur s=" unbounded" > 
<xsd : compl exType> 
<xsd : sequence> 

35 <xsd: element name="productName" type= "xsd: string" /> 

<X8d: element name =" quant i ty " > 
<xsd: simpleType> 
<xsd: restriction base="xsd;positiveInteger" > 
<xsd:maxExclusive value-" 100 "/> 
40 </xsd:reBtriction> 
</xsd: simpleType> 
</xsd:element> 

cxsd: element name* "USPr ice" type=" xsd: decimal "/> 
<x8d: element ref=s" comment" rainOccurs=i"0"/> 
45 <xsd: element narae="shipDate" type= "xsd: date" minOccurs="0"/> 

< /xsd : sequence > 

<xsd: attribute name="partNum" type="SKU" use=" required" /> 
</xsd : complexType> 
< /xsd : element > 
50 </xsd: sequence> 

< /xsd : compl exType> 

<!-- Stock Keeping Unit, a code for identifying products --> 
<xBd:simpleType name="SKU"> 
55 <xsd: restriction base« "xsd: string" > 

<xsd:pattem value="\d{3)-' [A-Z] (2}«/> 
</xsd:restriction> 
</xsd: simpleType> 

60 </xsd : schema > 
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Apt?endix C 

/* 

* Created on Aug 17, 2004 
♦ 

* To change the template for this generated file go to 

* Window - Preferences - Java - Code Generation - Code and Comments 
*/ 

package com.docomo.ss.exan^les; 



impor c com . docomo .88. Inval idS chema ; 
import com . docomo . s s . Node ; 
import com. docomo . ss . Parselet ; 

15 /** 

* ©author zhou 

* To change the template for this generated type comment go to 

* Window - Preferences - Java - Code Generation - Code and Comments 
20 */ 

public class PurchaseOrder extends Parselet { 

private static chart! ename_purchaseOrder = {'p', 'u', 'r', 'c', 'h', 
'a', 'B', 'e', 'O*, 'r', 'd», 
'e«, 'r'}; 

25 private static char[] ename comment = {'C, 'o*. »m', »m', 'e', 'n', 

public Node parse (char [] d, int off) { 
doc = d; 
30 offset = off; 

Node res = Node. create () ; 
int i = offset++; 
while (true) { 
35 try { 

if (doc [13 'P' && 
isElementFroml (ename_purchaseOrder) ) { 

increaseOf f set (ename_purchaseOrder) ; 

40 res . addChild (parsePurchaseOrderType { "purchaseOrder " ) ) ; 

} else if {doc[i] == && 
isElementFroml (ename_comment) ) { 

" increaseOf f set (ename_comment) ; 

res . addChild (parseString ( ename_commenc , 

45 "comment") ) ; 

} else throw new Inval idSchema () ; 

// try to check validity and exit condition here 
} catch (Exception e) { 

e.printStackTrace 0; 
50 break; 

} 

} 

return res; 

} 

55 

private static char[] ename^shipTo = {'s', 'h'. 'i\ 'p', 'T' , 'o'J; 
private static char [3 ename^billTo = { *b» , 'i', -1', '1^, 'T', »o'}; 
private static charl] ename_items = {'i'. 't', 'eS *m* , 's'}; 
private static charCl aname^orderDate « {'o', 'r', 'd' , 'eS 'r', 'D', 
60 'a*, 't', »e'}; 
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Node parsePurchaseOrderType (String name) throws InvalidSchema { 
Node res = Node. create 0; 



// Parse attributes 

if (xBAttribute {aname^orderDate) ) { 

increaseOf fset (aname_orderDate, 2) ; 
res.addAttribute (parseDate () , "orderDate") / 

} 



while {doc[off set++] '<*); 



// parse a sequence of elements 
if (isElement (ename_shipTo) ) { 

increaseOf fset {ename_shipTo) ; 

res.addChild<parseUSAddress ( "shipTo") ) ; 
} else throw new InvalidSchema () ; 

if (isElement (ename_billTo) ) { 

increaseOf fset (ename_billJo) ; 

res .addChild<parseUSAddress ( "billTo" ) ) ; 
} else throw new InvalidSchema {) ; 

if (isElement (ename_comraent) ) { 

increaseOf fset (ename^corament) ; 

res.addChild(parseString <ename_comment, "comment')) ; 

} 

if (isElement (ename^itetns) ) { 

increaseOf fset (ename_items) ; 

res.addChild {parseltems ("items")); 
} else throw new InvalidSchema () ; 



//tail processing TODO 

if (! EEMoveCursor (ename^purchaseOrder) ) 

throw new InvalidSchema () ; 
return res; 



ename__name » 


{'n', 'a', 'm^ 


• e ' } ; 


enamels treet 


= { 's» , 't' , 'r' 


, 'e' 


ename_city = 


{■c', 'i', 't'. 




ename^state = 


{•s', 't', 'a'. 


•ts 


ename_zip = { 







Node parseUSAddress (String name) throws InvalidSchema { 
Node res ~ Node. create (); 



// Parse attributes TODO 



// Parse sequence 

if (isElement (ename_name) ) { 

increaseOf fset (ename_name) ; 

res. addChild(parseSt ring (ename^name, "name") ) ; 
} else throw new InvalidSchema () ; 

if (isElement { enamels treet) ) { 

increaseOf fset ( enamels treet) ; 

res.addChild(par6eString ( enamels t reet , "street")); 
} else throw new InvalidSchema () ; 
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10 



15 



50 



if (isElement (enaine_city) ) { 

increaseOf f set (ename^city) / 

res.addChild(parseString (ename_city, "city")); 
} else throw new InvalidSchema 0 ; 

if (isElement (enatne_state) ) { 

increaseOf f set ( enamels t ate) ; 

res.addChild(parseString (ename^state. "state**)* ; 
} else throw new InvalidSchema () ; 

if (isElement (ename_zip) ) { 

increaseOf fset (ename^zip) ; 

res.addChild(parseDecimal (ename^zip, "zip")); 
} else throw new InvalidSchema 0 ; 

//tail processing TODO 
return res; 



} 

20 private static char[] ename_item = {'i'% 't', 'e', 'm^); 

Node parseltems (String name) throws InvalidSchema { - 
Node res a Node . create () ; 

25 //parse sequence 

while (true) { 

if (isElement (enarae_item) ) { 

increaseOf fset (ename_item) ; 
3Q res . addChild (parseUnnamedl ( " item" ) ) ; 

} else break; 

} 

// tail processing TODO 
35 return res; 

} 

private static char[l enamejroductNatne « {'p', 'r«, *o* , 'd', 'u', 
' c * ' t * 

40 ' ' 'N', »a', 'mS 'e'}; , . 

private static char[l ename_quantity - {'q', 'uS 'a', 'n , t , i < 

pri^te static char [1 en&me_USPrice » {'U', 'S', 'P', 'r', 'i', 'c», 

I e I J . rill 

45 ' private static chart] ename_shipDate = {'s'. 'h', 'i', 'P' * *D', 'a', 



Node parseimnamedl (String name) throws InvalidSchema { 
Node res « Node. create 0; 

//processing attribute TODO 



//processing sequence 
if (isElement (ename_productName) ) { 
55 increaseOf fset (ename_productName) ; 

res .addChild (parseString (enamejproductName, 

••productName") ) ; 

} else throw new InvalidSchema 0 ; 

60 if (isElement (ename_quantity) ) { 

increaseOf fset <ename__quantity) ; 
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res.addChildCparselnteger (1, 99, ename^quantity) ) ; 
} else throw new InvalidSchema () ; 

if (isElement (ename_USPrice) ) { 
5 increaseOf fset {ename_USPrice) ; 

res.addChild(parseDecimal (enaTOe_US Price, "USPrice"}); 
} else throw new InvalidSchema (} ; 

if (isEleraent (ename_comment) ) { 
10 increaseOf fset (ename^comment) ; 

res -addChild (parseString (ename^comment , "comment " ) ) ; 

} 

if {isElement (ename_shipDate) ) { 
15 increaseOf fset ( enamels hipDate) ; 

res.addChild(parseDate (ename_shipDate, "shipOate")) ; 

} 

20 //tail processing TODO 

return reso- 
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Appendix D 

/* 

* Created on Aug 17, 2004 
* 

5 * To change the template for this generated file go to 

* Window - Preferences - Java - Code Generation - Code and Comments 
*/ 

package com.docomo.ss; 

10 /** 

* ^author zhou 
* 

* To change the template for this generated type comment go to 

* Window - Preferences - Java - Code Generation - Code and Comments 
15 */ 

public abstract class Parselet { 

public static boolean verify = falser- 
protected char[] doc; 
20 protected int offset; 

public abstract Node parse (char [3 doc, int offset); 

protected boolean isBlementFroml (char[] element) { 
25 for (int i = 1; i < element . length; i++) 

if (doc [offset + i] 1= element [i]) return false; 
return true; 

} 

30 protected boolean isElement (chard element) { 

for (int i = 0; i < element . length; i++) 

if (doc[offset + i] element til) return false; 
if (doc [offset + element .length] == ' ^ | | 

doc [offset + element . length] =« ' > * ) 
35 return true; 

else return false; 

} 

protected boolean isAttribute (char[3 an) { 
40 for (int i = 0; i < an. length; i++) 

if (doc [offset + i] i= an[i]) return false; 
return true; 

} 

45 protected Node parseString (charCl name. String ename) throws 

InvalidSchema { 

if (doc [offset! J" *>') throw new InvalidSchema (); 
int start « ++offset; 

while {doc[off8et] != || docloffset - 1) -« •&») offset++; 
50 Node res = Node. create (ename, new String (doc, start, offset - 

D); 

return res; 

} 

55 protected Node parseDeciroal (chart] name. String ename) throws 

InvalidSchema { 

System, out. print In (" Parselet. parseDecimal not implemented!"); 
return null; 

} 

60 
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protected Node parselnteger (int min, int max, charC) name) throws 
InvalidSchema { 

System. out .println ("Parselet. parselnteger not implemented! ") ; 
return null; 

5 } 

protected Node parseDate (char[] name. String ename) throws 
InvalidSchema { 

System. out .println ( "Parselet .parseDate not implemented! ■*) ; 
10 return null; 

) 

protected Node parseDate () throws InvalidSchema { 

System, out .println ("Parselet .parseDate not implemented!"); 
15 return null; 

} 

protected void increaseOf £set (char[] ename) { 
offset ename. length; 

20 } 

protected void increaseOf f set (char[] ename, int additional) { 
offset ename. length -i- additional; 

} 

25 

protected boolean EEMoveCursor (char[] ename) { 
if (verify) { 

if (doc [offset] == doc [offset +11 == •/') { 

int i; 

30 for (i s 0; i < ename. length; i++) 

if (doc [offset -i- i 4> 2] ename [i]} return 



false; 



if (doc [offset + i + 2] !='>•) return false; 



35 



40 } 



offset ename. length + 3; 
return true; 
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A ppendix E 

/* 

* Created on Aug 17, 2004 
* 

* To change the template for this generated file go to 

* Window - Preferences - Java - Code Generation - Code and Cottiments 
*/ 

package com.docomo. ss; 
y ♦* 

* ©author zhou 
* 

* To change the template for this generated type comment go to 

* Window - Preferences - Java - Code Generation - Code and Comments 
*/ 

public class invalidSchema extends Exception { 

} 
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Appendix F 

* Created on Aug 17, 2004 
« 

* To change the template for this generated file go to 

* Window - Preferences - Java - Code Generation - Code and Comments 
V 

package com . docomo . ss ; 
/** 

* Gauthor zhou 
« 

* To change the template for this generated type comment go to 

* Window - Preferences - Java - Code Generation - Code and Comments 
*/ 

public abstract class Node inplements DOMNode { 
public static Node create 0 { 

System . out , print In ( "Node . create not inqplementedl " ) ; 
return null ; 

} 

public static Node create (String name. Object o) { 

System. out. println ( "Node . create not implemented]"); 
return null; 

} 

public abstract void addChild (Node n) ; 

public abstract void addAttribute (Node n, String name) 

} 
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Claims 

We claim: 

1 . A method for parsing an XML document, comprising: 

Performing an analysis of the XML document from an XML schema 
associated with the XML document to extract one or more relationships between 
XML elements included in the XML document; and 

parsing the XML elements of the XML document guided by relationships 
extracted in the analysis. 

2. A method as in Claim 1, wherein the relationships extracted from the analysis 
comprises a lexicographical distance between XML elements. 

3. A method as in Claim 1, wherein the relationships extracted from the analysis 
comprises state-transition information. 

4. A method as in Claim 1 , further comprising providing element object pools 
that are created upon system initialization. 

5. A method as in Claim 4, further comprising, upon parsing an XML element 
retrieving a corresponding element object from the element object pools; and 
filling the element object with values extracted from the parsed XML element. 

6. A raefliod as in Claim 5, further comprising providing the element object as a 
node in a parse tree. 

7. A method as in Claim 5, wherein the element object is returned to the element 
object pools. 

8. A mefliod as in Claim 4» wherein each element object in the element object 
pools correspond to an expected data structure of an XML element defined in the XML 
schema. 

9. A reconfigurable parser for an XML document, comprising: 

an analyzer for extracting fix)m an XML schema associated with the XML 
document relationships between XML elements included in the XML document; and 

a parser of the XML elements of the XML document guided by the 
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relationships extracted by the analyzer. 

10. A reconfigurable parser as in Claim 9, wherein the relationships extracted by 
the analyzer comprises a lexicographical distance between XML elements. 

I L A reconfigurable parser as in Claim 9» wherein the relationships extracted by 
S the analyzer comprises state-transition information. 

1 2. A reconfigurable parser as in Qaim 9, further comprismg element object pools 
that are created upon system initialization. 

13. A reconfigurable parser as in Claim 12, fiirther comprising: 

a selector for retrieving a corresponding element object from the element 
10 object pools, upon successfully parsing an XML element; and 

a writer for filling in the element object with values extracted from the parsed 
XML element. 

14. A reconfigurable parser as in Claim 13, further comprising a parse tree 
constructor which receives the element object as a node in a parse tree. 

15 1 5, A reconfigurable parser as in Claim 1 3, wherein the element object is returned 

to the element object pools. 

1 6. A reconfigurable parser as in Claim 1 2, wherein each element object in the 
element object pools correspond to an expected data structure of an XML element defined in 
the XML schema. 

20 1 7. A method for efficiently parsing an XML document, comprising: 

analyzing a schema associated with the XML document to extract data 
structures of XML elements of the XML document; 

generating parse code for each data structure of the XML elements; and 

parsing the XML elements using the generated parse code as the XML 
2S document is read. 

18. A method as in Claim 1 7, wherein the generated parse code is compiled. 

19. A method as in Claim 1 7, further comprising reading the XML elements from 
the XML document one element at a time. 
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20. A method as in Claim 1 9, wherein the XML elements are read into die parser 
according to a push model. 

21 . A method as in Claim 1 9, wherein the XML elements are read into the parser 
according to a pull model. 

5 22. A method as in Claim 1 7, farther comprising validating the XML elements 

against the XML schema. 

23. A method as in Claim 1 7» farther comprising providing the parsed XML 
elements in a parse tree. 

24. A method as in Claim 1 7, farther comprising providing the parsed XML 
1 0 elements one at a time in a continuous stream. 

25. A parser for an XML document, comprising: 

a schema analyzer for extracting data structures of XML elements from a 
schema associated with the XML document; 

a parse code generator that generates a parse code for each data structure of 
15 the XML elements; and 

a parser integrator that invokes a corresponding parse code in respond to each 
XML element encountered as the XML document is read. 

26. A parser as in Claim 25, wherein the generated parse code is compiled. 

27. A parser as in Claim 25, farther comprising an XML reader that reads the 
20 XML elements from the XML document one element at a time. 

28. A parser as in Claim 27, wherein the XML elemmts are read into the parser 
according to a push model. 

29. A pai^ as in Claim 27, wherein the XML elements are read into the parser 
according to a pull model. 

25 30. A parser as in Claim 25, wherein the parse code validates the XML elements 

against the XML schema. 

31. A parser as in Claim 25, farther comprising an output module that provides the 
parsed XML elements in a parse tree. 
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32. A parser as in Claim 25, ftirfher comprising an output module that provides the 
parsed XML elements one at a time in a continuous stream. 



5 
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